Sixt Data Science Lab - Test Task for Data Scientist Job Candidates¶

Introduction¶

In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.

The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].

Please go through the steps below, build up the necessary code and comment on your choices.

Part 1 - Data Loading and Environment Preparation¶

Tasks:

  1. Prepare a Python 3 virtual environment (with virtualenv command). requirements.txt output of pip freeze command should be included as part of your submission.
  2. Load the data from UCI Repository and put it into the same folder with the notebook. The link to it is https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset . Here is an available mirror in case the above website is down: https://data.world/uci/bike-sharing-dataset
  3. We split the data into two parts. One dataset containing the last 30 days and one dataset with the rest.
In [1]:
import pandas as pd
import numpy as np

# read raw data
df_all = pd.read_csv('day.csv')

# split dataset
df_last30 = df_all.tail(30)
df = df_all.iloc[:-30, :]

df.head()
Out[1]:
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 331 654 985
1 2 2011-01-02 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 131 670 801
2 3 2011-01-03 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 120 1229 1349
3 4 2011-01-04 1 0 1 0 2 1 1 0.200000 0.212122 0.590435 0.160296 108 1454 1562
4 5 2011-01-05 1 0 1 0 3 1 1 0.226957 0.229270 0.436957 0.186900 82 1518 1600

Part 2 - Data Processing and Analysis¶

Tasks:

  1. Perform all needed steps to load and clean the data. Please comment the major steps of your code.
  2. Visualise rentals of bikes per day.
  3. Assume that each bike has exactly maximum 12 rentals per day.
    • Find the maximum number of bicycles nmax that was needed in any one day.
    • Find the 95%-percentile of bicycles n95 that was needed in any one day.
  4. Visualize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)

Pat2 - Task 1 - Perform all needed steps to load and clean the data. Please comment on the major steps of your code.¶

Inspect the Data¶

In [2]:
# Check the shape of the dataset (rows, columns)
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Check the data types of each column
print("\nData types of each column:")
print(df.dtypes)
df.describe()
Dataset contains 701 rows and 16 columns.
Missing values in each column:
instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

Data types of each column:
instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object
Out[2]:
instant season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
count 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000
mean 351.000000 2.479315 0.479315 6.285307 0.028531 3.004280 0.684736 1.385164 0.502732 0.480847 0.625717 0.190534 866.937233 3661.104137 4528.041369
std 202.505555 1.090839 0.499929 3.329294 0.166602 2.003207 0.464953 0.542489 0.182781 0.162584 0.141988 0.076740 693.470674 1553.467783 1939.766889
min 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.059130 0.079070 0.000000 0.022392 2.000000 20.000000 22.000000
25% 176.000000 2.000000 0.000000 3.000000 0.000000 1.000000 0.000000 1.000000 0.343478 0.348470 0.519167 0.134958 317.000000 2507.000000 3194.000000
50% 351.000000 2.000000 0.000000 6.000000 0.000000 3.000000 1.000000 1.000000 0.514167 0.503146 0.623750 0.182221 738.000000 3656.000000 4541.000000
75% 526.000000 3.000000 1.000000 9.000000 0.000000 5.000000 1.000000 2.000000 0.660000 0.613025 0.728750 0.233221 1135.000000 4739.000000 6041.000000
max 701.000000 4.000000 1.000000 12.000000 1.000000 6.000000 1.000000 3.000000 0.861667 0.840896 0.972500 0.507463 3410.000000 6946.000000 8714.000000

Handle Missing Values¶

In [3]:
# Since there are no missing values, no further action is needed here
# If there were missing values, we could fill them or drop the rows

Convert Data Types¶

In [4]:
# Convert 'dteday' to datetime
df['date'] = df['dteday'].apply(pd.to_datetime)
df = df.drop('dteday',  axis=1)
# Confirm the conversion
print("\nData type of 'dteday' after conversion:")
print(df['date'].dtype)
Data type of 'dteday' after conversion:
datetime64[ns]
C:\Users\Akhod\AppData\Local\Temp\ipykernel_11156\1264700120.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = df['dteday'].apply(pd.to_datetime)

Remove Unnecessary Columns¶

In [5]:
# Drop the'instant' as the column is just a record index, it is not necessary.
df = df.drop(columns=['instant'])

# Check the dataset after dropping the column
print("\nColumns after dropping 'instant':")
print(df.columns)
Columns after dropping 'instant':
Index(['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt', 'date'],
      dtype='object')

Handle Categorical Variables¶

In [6]:
# Convert 'season', 'weathersit', 'weekday', and 'mnth' to categorical data types
categorical_columns = ['season', 'weathersit', 'weekday', 'mnth', 'holiday', 'workingday', 'yr']

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Confirm the conversion
print("\nData types after conversion to categorical:")
print(df.dtypes)
Data types after conversion to categorical:
season              category
yr                  category
mnth                category
holiday             category
weekday             category
workingday          category
weathersit          category
temp                 float64
atemp                float64
hum                  float64
windspeed            float64
casual                 int64
registered             int64
cnt                    int64
date          datetime64[ns]
dtype: object

Normalize or Scale Data¶

In [7]:
# The data is already normalized, so this step might not be necessary.

Final Inspection¶

In [8]:
# Tthe first five rows
df.head()

# Summary of the dataset
df.describe(include='all')
Out[8]:
season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt date
count 701.0 701.0 701.0 701.0 701.0 701.0 701.0 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701.000000 701
unique 4.0 2.0 12.0 2.0 7.0 2.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN
top 3.0 0.0 1.0 0.0 6.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
freq 188.0 365.0 62.0 681.0 101.0 480.0 451.0 NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN NaN NaN NaN 0.502732 0.480847 0.625717 0.190534 866.937233 3661.104137 4528.041369 2011-12-17 00:00:00
min NaN NaN NaN NaN NaN NaN NaN 0.059130 0.079070 0.000000 0.022392 2.000000 20.000000 22.000000 2011-01-01 00:00:00
25% NaN NaN NaN NaN NaN NaN NaN 0.343478 0.348470 0.519167 0.134958 317.000000 2507.000000 3194.000000 2011-06-25 00:00:00
50% NaN NaN NaN NaN NaN NaN NaN 0.514167 0.503146 0.623750 0.182221 738.000000 3656.000000 4541.000000 2011-12-17 00:00:00
75% NaN NaN NaN NaN NaN NaN NaN 0.660000 0.613025 0.728750 0.233221 1135.000000 4739.000000 6041.000000 2012-06-09 00:00:00
max NaN NaN NaN NaN NaN NaN NaN 0.861667 0.840896 0.972500 0.507463 3410.000000 6946.000000 8714.000000 2012-12-01 00:00:00
std NaN NaN NaN NaN NaN NaN NaN 0.182781 0.162584 0.141988 0.076740 693.470674 1553.467783 1939.766889 NaN

Part 2 - Task 2 Visualise rentals of bikes per day.¶

In [9]:
df.set_index('date', inplace=True)
daily_rentals = df.cnt
In [10]:
import matplotlib.pyplot as plt

# Plot
plt.figure(figsize=(12, 6))
daily_rentals.plot(kind='line', color='g', linestyle='-', marker='o')
plt.title('Bike Rentals Per Day')
plt.xlabel('Date')
plt.ylabel('Number of Rentals')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [11]:
df.reset_index('date', inplace=True)
# df
In [12]:
# !pip install plotly_calplot
# Heatmap of total number of ride
from plotly_calplot import calplot
# creating the plot
fig = calplot(
         df,
         x='date',
         y='cnt'
)
fig.show()
plt.show()

Analysis of the Bike Rentals Per Day Chart¶

The charts show a clear annual trend in bike rentals. The number of rentals peaks in the summer months (June, July, and August) and is lowest in the winter months (December, January, and February). This seasonal pattern is likely due to favorable weather conditions during the summer months, making cycling more appealing.

Specific Observations:

  • Year-over-Year Comparison: There appears to be a slight increase in overall rentals from 2011 to 2012. This could be attributed to factors such as increased bike-sharing infrastructure, the growing popularity of cycling as a mode of transportation, or changes in weather patterns.
  • Seasonal Fluctuations: The seasonal peaks and troughs in rentals are consistent across both years. This indicates a strong correlation between weather and bike rental behavior.
  • Weekly Patterns: The heatmap chart shows some weekly patterns in bike rentals, with higher demand on weekdays and lower demand on weekends specifically on Sundays.
  • Outliers: A few data points deviate significantly from the overall trend. These could be because of extreme weather events, special events, or data collection errors.

Potential Factors Influencing Bike Rentals:

  • Weather: Temperature, precipitation, and daylight hours all influence bike rental demand.
  • Seasonality: Spring and summer are generally more favorable for cycling.
  • Special Events: Major events or holidays can increase or decrease bike rentals.
  • Infrastructure: The availability of bike lanes, parking facilities, and bike-friendly routes can impact rental numbers.
  • Economic Conditions: Economic factors such as unemployment rates and fuel prices can influence people's choice of transportation.

Suggestions:

To gain a deeper understanding of bike rental trends, additional analysis could be conducted, such as:

  • Correlating rentals with weather data: This would help quantify the impact of weather on bike usage.
  • Examining weekly and daily patterns: Analyzing rentals on a finer time scale can reveal more detailed insights.

Part 2 - Task 3. Assume that each bike has exactly maximum 12 rentals per day.¶

* Find the maximum number of bicycles `nmax` that was needed in any one day.
* Find the 95%-percentile of bicycles `n95` that was needed in any one day.
In [13]:
# Maximum rentals per day per bike
max_rentals_per_bike = 12

# Calculate the number of bikes needed each day
bikes_needed = np.ceil(daily_rentals / max_rentals_per_bike).astype(int)
df['bikes_needed'] = list(bikes_needed)

# Calculate the maximum number of bikes needed on any one day, which indicates the peak demand for bicycles on any single day.
nmax = bikes_needed.max()

# Calculate the 95th percentile of the number of bikes needed
# This value indicates that 95% of the days had a demand for bicycles less than or equal to this number.
n95 = int(np.percentile(bikes_needed, 95))

print(f"Maximum number of bicycles needed on any one day: {nmax}")
print(f"95th percentile of bicycles needed on any one day: {n95}")
# df
Maximum number of bicycles needed on any one day: 727
95th percentile of bicycles needed on any one day: 632

Part 2 - Task 4 - Visualize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)¶

In [14]:
# Define bin edges (e.g., bins of size 5)
bin_edges = np.arange(0, bikes_needed.max() + 10, step=5)  # Adjust step size as needed

# Calculate the percentage of days covered by each number of bikes
coverage_percentages = []
for ege in bin_edges :
    coverage_percentage = (bikes_needed.values <= ege).mean() * 100
    coverage_percentages.append(coverage_percentage)
In [15]:
# Plot
plt.figure(figsize=(12, 6))
plt.plot(bin_edges, coverage_percentages, color='b')
plt.title('Coverage of Days Depending on Number of Available Bicycles (Binned)')
plt.xlabel('Number of Bicycles')
plt.ylabel('Percentage of Days Covered')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis of the Chart: Coverage of Days Depending on Number of Available Bicycles¶

The chart displays a clear upward trend, indicating that as the number of available bicycles increases, the percentage of days with sufficient coverage also increases. This reveals the existence of a positive correlation between the availability of bicycles and the ability to fulfill the demand.

Specific Observations:

  • Initial Plateau: The graph starts with a relatively flat section, suggesting that even with a small number of bicycles, a certain percentage of days can be covered. This might be due to factors like low demand on some days or the ability to redistribute bicycles from areas with extra supply.
  • Steadily Increasing Coverage: As the number of bicycles increases, the coverage rate rises steadily.
  • Asymptotic Behavior: The curve seems to approach a maximum value, suggesting that there might be a point of diminishing returns where adding more bicycles doesn't significantly increase coverage. This could be due to factors like congestion of demand or limitations in infrastructure or operational capacity (If the bike supply data was available, we could provide more insight into the market capacity).

Potential Implications:

  • Optimal Bicycle Inventory: The chart can help determine the optimal number of bicycles to maintain based on desired coverage levels. By identifying the point where the curve starts to plateau, organizations can assess whether the additional cost of more bicycles justifies the increase in coverage.
  • Demand Forecasting: The data can be used to forecast future demand and adjust bicycle inventory accordingly. If demand is expected to increase, organizations can proactively add more bicycles to maintain coverage.
  • Infrastructure Planning: The chart can inform decisions about expanding bike-sharing infrastructure, such as adding more docking stations or increasing the fleet size. By understanding the relationship between bicycle availability and coverage, organizations can make data-driven decisions about resource allocation.

Further Analysis:

To gain a deeper understanding of the relationship between bicycle availability and coverage, additional analysis could be conducted, such as:

  • Seasonal Variations: Examining the data for seasonal patterns can help identify periods of higher or lower demand.
  • Geographic Factors: Analyzing coverage by location can reveal variations in demand across different areas.
  • Weather Impacts: Investigating the relationship between weather conditions and bicycle usage can help identify factors that influence demand.
  • Competitive Analysis: Comparing the data with information about competing bike-sharing services can provide insights into market dynamics.

Further insights¶

Correlation table¶

In [16]:
import seaborn as sns
cor = df.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(cor, cmap=plt.cm.CMRmap_r,annot=True)
plt.show()
cor['cnt']
No description has been provided for this image
Out[16]:
date            0.706355
season          0.389125
yr              0.603446
mnth            0.323326
holiday        -0.058080
weekday         0.065928
workingday      0.048162
weathersit     -0.288684
temp            0.631649
atemp           0.633885
hum            -0.101409
windspeed      -0.234871
casual          0.680335
registered      0.944966
cnt             1.000000
bikes_needed    0.999999
Name: cnt, dtype: float64
Analysing the linear correlation heatmap diagram¶

Strong Positive Correlations

  • Temp (0.631649) and Atemp (0.633885): Both actual temperature and "feels-like" temperature have a strong positive correlation with the rental count, indicating that warmer temperatures lead to higher bike usage.
  • Year (0.603446): A positive correlation with the year implies an increasing trend in bike rentals year over year, likely due to the service growing in popularity or geographic expansion (As we have only two values for the year column, the correlation is not significant and reliable).

Weak Positive Correlations

  • Weekday (0.065928) and Working Day (0.048162): The very weak positive correlations suggest that the day of the week or whether it is a working day has little impact on the total rental count. Rentals might be fairly consistent throughout the week.

Negative Correlations

  • Holiday (-0.058080): This very weak negative correlation suggests that rentals slightly decrease on holidays, but the effect is minimal.
  • Humidity (-0.101409): A weak negative correlation with humidity suggests that higher humidity slightly discourages bike rentals, but the effect is small.

Moderate Negative Correlations

  • Windspeed (-0.234871): A moderate negative correlation with wind speed indicates that stronger winds are associated with fewer bike rentals, which again is understandable as windy conditions can make biking less enjoyable or more difficult.
  • Weather Situation (-0.288684): A moderate negative correlation with weather conditions indicates that worse weather (like rain or snow) leads to fewer bike rentals, which is intuitive.

Summary

  • Most Influential Factors: temperature.
  • Negative Influences: Weather conditions like worse weather, higher wind speeds, and, with lower impact, higher humidity negatively impact the number of rentals.
  • Minimal Impact: Holidays, and whether it's a working day seem to have little effect on the overall rental count.

Part 3 - Building prediction models¶

Tasks:

  1. Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.
  2. Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.
  3. Report the value of the chosen test metric on the provided data.

Part 3- Task 1- Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.¶

Test Metric for Predicting Daily Bike-Sharing Demand

To measure the accuracy of prediction models for daily bike-sharing demand, the Root Mean Squared Error (RMSE) is a suitable test metric. Here's why RMSE is a strong choice:

Definition of RMSE The Root Mean Squared Error (RMSE) is defined as:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

where:

  • $n$ is the number of observations (days in this context),
  • $y_i$ is the actual demand (i.e., the actual count of bike rentals on day $i$),
  • $\hat{y}_i$ is the predicted demand by the model for day $i$.

Why RMSE? RMSE is chosen as the test metric for predicting daily bike-sharing demand because it effectively balances the need to penalize large errors, provides an intuitive and interpretable measure of accuracy, and is a widely accepted standard in regression analysis. By minimizing RMSE, we aim to develop a model that predicts bike demand as accurately as possible, ensuring efficient and effective management of the bike-sharing service.

Comparison to Other Metrics

  • Mean Absolute Error (MAE): MAE is another common metric but is less sensitive to outliers compared to RMSE. In cases to penalize large errors more (as in this case), RMSE is more appropriate.

  • Mean Squared Error (MSE): While MSE is the precursor to RMSE, its value is in squared units of the target variable, making it harder to interpret. RMSE resolves this by taking the square root.

  • R-squared ($R^2$): $R^2$ provides a relative measure of fit, indicating the proportion of variance explained by the model. While useful for understanding overall model performance, it doesn't provide the same level of direct, interpretable insight into the magnitude of prediction errors as RMSE does.

Part 3- Task 2- Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.¶

In [17]:
df.columns
Out[17]:
Index(['date', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt', 'bikes_needed'],
      dtype='object')
In [18]:
# Some non-numerical columns like 'date' must be removed for modeling.
# Columns of 'casual', 'registered', and 'bikes_needed' are yielded from the target or the target is yielded from them so that they will be removed
to_drop = ['date', 'casual', 'registered', 'bikes_needed']
df.drop(to_drop, axis=1, inplace=True)
df.head()
Out[18]:
season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed cnt
0 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 985
1 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 801
2 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 1349
3 1 0 1 0 2 1 1 0.200000 0.212122 0.590435 0.160296 1562
4 1 0 1 0 3 1 1 0.226957 0.229270 0.436957 0.186900 1600

Fitting the RF model¶

In [19]:
# A simple Random forest
from sklearn.ensemble import RandomForestRegressor
X, y = df.iloc[:,:-1], df.iloc[:,-1]
RFreg_simple = RandomForestRegressor()
RFreg_simple.fit(X, y)
Out[19]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
In [20]:
# The columns that weres deleted in training must be eliminated in testing
df_test = df_last30.drop(['instant', 'dteday', 'casual', 'registered'], axis=1)

Part 3- task 3- Report the value of the chosen test metric on the provided data¶

Evaluating the model¶

In [21]:
# Evaluation for the simple random forest
from sklearn.metrics import *

X_test , y_test = df_test.iloc[:,:-1], df_test.iloc[:,-1]
simpleRF_prediction = RFreg_simple.predict(X_test )

print('RMSE_train:', root_mean_squared_error(y, RFreg_simple.predict(X)))
print('RMSE_test:', root_mean_squared_error(y_test, simpleRF_prediction))
print('MSE of test:', mean_squared_error(y_test, simpleRF_prediction))
print('MAE of test:', mean_absolute_error(y_test, simpleRF_prediction))

print("Test R^2:", r2_score(y_test, simpleRF_prediction))
RMSE_train: 242.69291532026153
RMSE_test: 1143.3229617304114
MSE of test: 1307187.3948199996
MAE of test: 895.3226666666666
Test R^2: 0.5917602875761707

RMSE shows a high level of error. Maybe Tuning the regressor model improves the error

Part 4 - Reflection / comments¶

Tasks:

(Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.

Part 4- Hyper parameter optimization of RF¶

Please do not run the following cell if you do not want to wait a few hours :))

In [138]:
'Random Forest'

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
## Using grid search
rf = RandomForestRegressor()

# # The function to measure the quality of a split.
criterion = [None, 'friedman_mse', 'poisson','squared_error', 'absolute_error' ]
# Number of trees in random forest
n_estimators = list(range(50,400,50))
# Number of features to consider at every split
max_features = [None,  2, 3, 4, 5, 6, 7, 8,9,10, 11, 'auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = list(range(3, 15))
max_depth.append(None)
# Minimum number of samples required to split a node
# min_samples_split = [2, 4, 6, 8, 10]
# Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 3, 4]

# Create the random grid
random_grid = {
                'criterion': criterion,
                'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth,
                # 'min_samples_split': min_samples_split,
                # 'min_samples_leaf': min_samples_leaf,
}

#The data is time-series so we need to use a time-series K-fold split
tscv = TimeSeriesSplit(n_splits=3)
tuning_model=GridSearchCV(rf, param_grid=random_grid, n_jobs=-1,
                          scoring='neg_root_mean_squared_error', cv=tscv, verbose=1, return_train_score=True)

tuning_model.fit(X, y)
# best hyperparameters 
print('Best params:', tuning_model.best_params_)
Fitting 3 folds for each of 4368 candidates, totalling 13104 fits
C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:


1008 fits failed out of a total of 13104.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
634 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Program Files\Python310\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

--------------------------------------------------------------------------------
374 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Program Files\Python310\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.


C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:

One or more of the test scores are non-finite: [-1668.23860228 -1636.20084943 -1644.36404681 ... -1388.85966757
 -1368.65889195 -1374.33253737]

C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:

One or more of the train scores are non-finite: [-666.46585393 -632.16682121 -644.91144761 ... -205.95360296 -206.22356903
 -206.4015325 ]

Best params: {'criterion': 'absolute_error', 'max_depth': 12, 'max_features': 5, 'n_estimators': 100}
In [141]:
results = []
results = pd.DataFrame.from_dict(tuning_model.cv_results_)

train_score= results['mean_train_score']
train_score_std= results['std_train_score']
cv_score = results['mean_test_score'] 
cv_score_std= results['std_test_score']
param_max_depth = results['param_max_depth']

print(results.shape)
results.head()
(4368, 20)
Out[141]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_max_features param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
0 0.076591 0.002868 0.003336 4.717188e-03 friedman_mse 3 2 50 {'criterion': 'friedman_mse', 'max_depth': 3, ... -874.400826 -2163.260733 -1967.054247 -1668.238602 567.014437 4030 -470.017185 -616.310180 -913.070196 -666.465854 184.319806
1 0.151746 0.007192 0.005208 7.364684e-03 friedman_mse 3 2 100 {'criterion': 'friedman_mse', 'max_depth': 3, ... -874.024867 -2164.108577 -1870.469105 -1636.200849 552.111224 4016 -430.341695 -591.963150 -874.195618 -632.166821 183.419060
2 0.224064 0.007274 0.015623 1.123916e-07 friedman_mse 3 2 150 {'criterion': 'friedman_mse', 'max_depth': 3, ... -871.098234 -2157.454362 -1904.539545 -1644.364047 556.444959 4024 -441.387972 -600.095792 -893.250579 -644.911448 187.174226
3 0.299290 0.013429 0.014777 1.196802e-03 friedman_mse 3 2 200 {'criterion': 'friedman_mse', 'max_depth': 3, ... -856.963866 -2154.028673 -1931.999919 -1647.664152 566.409392 4026 -446.831962 -593.749770 -881.394468 -640.658733 180.483575
4 0.372235 0.013515 0.014155 8.067228e-03 friedman_mse 3 2 250 {'criterion': 'friedman_mse', 'max_depth': 3, ... -828.944064 -2156.715865 -1838.925587 -1608.195172 566.081197 3988 -442.465594 -599.835763 -873.659792 -638.653716 178.161414
In [192]:
plt.rcParams["figure.figsize"] = [9, 6]
plt.rcParams["figure.autolayout"] = True
plt.plot(param_max_depth.astype(float), np.abs(train_score), label = 'Train')
plt.plot(param_max_depth.astype(float), np.abs(cv_score), label = 'Validation')
plt.title('Average Training error in Grid Search CV vs. Max_depth for both validation and training phases')
plt.xlabel('max_depth')
plt.ylabel('Mean train error')
plt.legend()
plt.show()
No description has been provided for this image

Analyzing the Training Score vs. Max_Depth Plot

The plot shows on aspect of hyperparameter tuning with Grid Search CV. It visualizes the relationship between the average training score and the max_depth hyperparameter for both the training and validation phases of Random Forest model.

Key Observations:

  • Training Score: The training error, represented by the blue line, generally decreases as the max_depth increases. This is expected because deeper trees can capture more complex patterns in the training data, leading to better performance on the training set.
  • Validation Score: The validation score, represented by the orange line, initially decreases with max_depth but eventually starts to increase. This indicates that while deeper trees perform well on the training data, they might be overfitting, leading to poorer generalization performance on unseen data.
  • Overfitting: The gap between the training and validation scores widens as max_depth increases. This suggests that the model is becoming increasingly overfit, learning the training data too well and failing to generalize to new data.
  • Optimal Max_Depth: The optimal max_depth is likely the point where the validation score is lowest. This is the point where the model generalizes best to unseen data without overfitting.

Part 4- Fiting the best model¶

In [209]:
# The parameters of the model is set based on tuning_model.best_params_
# Best params: {'criterion': 'absolute_error', 'max_depth': 12, 'max_features': 5, 'n_estimators': 100}

Best_RF_model = RandomForestRegressor(criterion='absolute_error', max_depth= 12, max_features=5,
                                     n_estimators= 100, n_jobs=-1)

Best_RF_model.fit(X, y)
Out[209]:
RandomForestRegressor(criterion='absolute_error', max_depth=12, max_features=5,
                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(criterion='absolute_error', max_depth=12, max_features=5,
                      n_jobs=-1)

Part 4- Evaluating the best model¶

In [210]:
# prediction 
predict = Best_RF_model.predict(X_test)
train_predict = Best_RF_model.predict(X)
In [211]:
###Metrics
print('RMSE_train:', root_mean_squared_error(y, train_predict))
print('RMSE_test:', root_mean_squared_error(y_test, predict))
print('MSE of test:', mean_squared_error(y_test, predict))
print('MAE of test:', mean_absolute_error(y_test, predict))
print("Test R^2:", r2_score(y_test, predict))
RMSE_train: 273.1367232023141
RMSE_test: 1037.2987758645047
MSE of test: 1075988.75041
MAE of test: 809.3443333333332
Test R^2: 0.6639645243066773

The model metrics did not change a lot.

This means either:

  • Random Forest cannot capture seasonality, remove variance errors, or capture time-related patterns. These make it not a proper model for this problem or
  • Hyperparameter tuning was not enough deep and granular!

Suggestion:

  • Applying models that are designed for time-series data.

    1. Applying LSTM, RNN, or GRU model to this data. In this case, collecting and providing more data is necessary to fit the DNNs.
    2. Models like ARIMA and FBProphet can be used to deal with low-number data.
  • Applying more advanced tuning on the RF model to achieve a better result.

    This approach is not recommended Since we already have overfiting (RMSE_train =273 and RMSE_test = 1037).

Overall outcome:

Considering all the circumstances, Applying the ARIMA model, FBPropghet or similar methods are the potential solutions.

In [186]:
# Visualizing y_test vs. prediction
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = [15, 8]
plt.rcParams["figure.autolayout"] = True

x = list(range(len(y_test)))
y1 = predict
y2 = y_test

# plt.legend()
plt.subplot(211)
plt.title('Prediction Vs. Actual value')
plt.xlabel('index')
plt.ylabel('Number of rides')
plt.plot(x, y1, color='red', lw=5, label='Ride prediction')
plt.plot(x, y2, color='Blue', lw=3, label='Actual number of rides')
plt.legend()
Out[186]:
<matplotlib.legend.Legend at 0x164e5d3d390>
No description has been provided for this image

Analysis of the Ride Prediction Chart

The chart compares predicted ride numbers to actual bike ride numbers daily. The general trend indicates that the prediction model is moderately accurate, with the predicted values following the overall shape of the actual values.

Specific Observations:

  • Correlation: There's a clear correlation between the predicted and actual values, suggesting that the model is capturing some underlying patterns in the data.
  • Underestimation and Overestimation: In certain periods, the model underestimates the actual number of rides, while in others it overestimates. This indicates that the model's accuracy varies over time.
  • Lag: There might be a slight lag between the predicted and actual values. This could be due to factors like the model's training data or the inherent time delay in the system.
  • Outliers: A few data points deviate significantly from the overall trend. These could be due to unusual events or anomalies in the data.
In [189]:
Error = pd.DataFrame(list(predict - y_test), columns=['Ride Error distribution'])
g = sns.displot(Error, kde=True, kind="hist", bins=20)
g.set_titles("Error distribution")
Error
Out[189]:
Ride Error distribution
0 -1595.715
1 52.495
2 35.360
3 659.840
4 -1507.250
5 -301.305
6 -2246.790
7 217.915
8 -999.320
9 -412.075
10 -145.020
11 -314.990
12 -442.040
13 258.100
14 -607.625
15 -937.865
16 288.815
17 36.080
18 1051.495
19 344.395
20 1340.040
21 1455.160
22 2156.880
23 2626.990
24 2131.260
25 992.680
26 370.870
27 1707.290
28 1088.540
29 491.890
No description has been provided for this image
In [190]:
sns.distplot(Error, kde=True, bins=20)
C:\Users\Akhod\AppData\Local\Temp\ipykernel_23292\1523101246.py:1: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


Out[190]:
<Axes: ylabel='Density'>
No description has been provided for this image

Analyzing the two Ride Error Histograms

The first histogram presents the distribution of ride errors, representing the difference between predicted and actual ride values. The x-axis shows the error value group intervals for the number of rides, which means that the errors are categorized into bins based on the number of rides predicted or actual. The second one is the same histogram while the y-coordinate indicates the density of each bin.

  • Distribution Shape: The histogram indicates a roughly normal distribution of ride errors, with a central peak and symmetrical tails. This suggests that the model's predictions are generally accurate, with errors distributed relatively evenly around the mean.
  • Skewness: There's a slight leftward skew to the distribution, indicating that there are slightly more negative errors (underestimations) than positive errors (overestimations).
  • Outliers: A few outliers are visible, particularly on the left side of the distribution. These represent extreme cases where the model significantly underestimated ride values.
  • Central Tendency: The majority of errors are clustered around the 0 mark, indicating that the model's predictions are generally accurate.
  • Spread: The distribution is relatively spread out, suggesting that there is some variability in the model's errors.

Permutation Test for Model Significance¶

Conduct a permutation test to assess the statistical significance of the tuned RF model.

For calculation of P_value The link is used:

In [202]:
from sklearn.utils import shuffle

# Step 1: Train the actual model and calculate % variance explained (R-squared)
Best_RF_model.fit(X, y)
y_pred = Best_RF_model.predict(X_test)
actual_r2 = r2_score(y_test, y_pred)

# Step 2: Permute the dependent variable and calculate % variance explained repeatedly
n_permutations = 1000  # By increasing this to 10,000 or more better accuracy is expected.
permuted_r2_scores = []

for _ in range(n_permutations):
    # Permute the dependent variable
    y_train_permuted = np.random.permutation(y)
    
    # Fit a Random Forest model on the permuted data
    Best_RF_model.fit(X, y_train_permuted)
    
    # Predict and calculate the R-squared
    y_pred_permuted = Best_RF_model.predict(X_test)
    permuted_r2 = r2_score(y_test, y_pred_permuted)
    permuted_r2_scores.append(permuted_r2)

# Step 3: Calculate the p-value
permuted_r2_scores = np.array(permuted_r2_scores)
p_value = np.sum(permuted_r2_scores >= actual_r2) / n_permutations

# Output the results
print(f"Actual R-squared: {actual_r2:.4f}")
print(f"P-value: {p_value:.4f}")
Actual R-squared: 0.6684
P-value: 0.0000

The P_value is almost zero

This indicates that the provided model is statistically significant, which means the map (e.g. model's weights or tree stucture) between features and the target is not random or arbitary. This model is statistically more than $99\%$ reliable.

Feature importance¶

In [214]:
importances = Best_RF_model.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort the feature importances in descending order

# Names of the features
features = X.columns if hasattr(X, 'columns') else [f'Feature {i}' for i in range(X.shape[1])]

# Create the plot
# Bar Plot of Feature Importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [features[i] for i in indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
No description has been provided for this image

Analyzing the Feature Importance Bar Chart

  • Most Important Feature: The year feature is the most important feature according to the chart, contributing significantly to the model's predictions.
  • Temperature-Related Features: Features related to temperature ("temp" and "atemp") also appear to be quite important, suggesting that temperature plays a significant role in the number of rides variable.
  • Humidity and Month: Humidity and month are also moderately influence on the model's predictions.
  • Seasonal and Weather-Related Features: Features like "season," "windspeed," "weathersit," and "weekday" have lower importance scores, suggesting that they might be less influential or have more complex relationships with the target variable.
  • Working Day and Holiday: The features "workingday" and "holiday" have the lowest importance scores, indicating that they have minimal impact on the model's predictions. They can be a good candidate for feature elimination in the Feature Selection phase.

Answers / comments / reasoning:

Submission¶

Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.

References¶

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.